Skip to content

[WIP: Do not review] NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps#2186

Draft
sifakis wants to merge 51 commits intoAcademySoftwareFoundation:masterfrom
sifakis:vbm-cpu-port
Draft

[WIP: Do not review] NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps#2186
sifakis wants to merge 51 commits intoAcademySoftwareFoundation:masterfrom
sifakis:vbm-cpu-port

Conversation

@sifakis
Copy link
Copy Markdown
Contributor

@sifakis sifakis commented Mar 27, 2026

Draft for CI and diff review. WIP.

@sifakis sifakis changed the title NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps [WIP: Do not review] NanoVDB: VoxelBlockManager CPU port of decodeInverseMaps Mar 30, 2026
sifakis and others added 2 commits April 2, 2026 13:46
Add WenoLeafPtrs<BuildT>, resolveWenoLeafPtrs, and computeWenoStencil as
static __device__ members of VoxelBlockManager<Log2BlockWidth>. These
implement the first phase of a two-function WENO5 stencil gather:
resolveWenoLeafPtrs performs exactly 3 probeLeaf calls (one per axis) to
resolve neighbor leaf pointers; computeWenoStencil fills a caller-provided
array with the 19 global sequential indices using WenoPt<i,j,k>::idx.

voxelOffset arithmetic uses octal notation: NanoVDB leaf layout encodes
(x,y,z) as x*64+y*8+z, so x/y/z strides are 0100/010/1 in octal.
WenoPt<i,j,k>::idx is used throughout to remain independent of any future
re-alignment with OpenVDB's NineteenPt (which uses a different convention).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Adds the ex_voxelBlockManager_host_cuda example demonstrating the CPU and
CUDA VoxelBlockManager implementations, along with design documentation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
sifakis and others added 15 commits April 2, 2026 13:50
Remove DecodeInverseMapsCPUPlan.md (implementation complete) and distill
its non-obvious design decisions into the knowledge base:

- §11: decodeInverseMaps is intentionally single-threaded/stateless;
  caller distributes blocks; contrast with cooperative GPU version.
- §12: mPrefixSum is bypassed for bulk access — recomputing from raw mask
  words via buildMaskPrefixSums is cheaper than unpacking 9-bit fields;
  mPrefixSum is still used for the cross-word offset in Step 5.
- §13: output fill is range-fill + contiguous copy (not scatter) because
  shuffleDownMask produces a sorted compacted array; std::fill/copy caveat
  for alignment when output arrays come from TLS or stack pointers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Implementation complete; design rationale distilled into VBMImplementationKnowledge.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Design reference for the per-block stencil gather kernel: decodes inverse
maps into block-local scratch, then resolves neighbor leaf pointers and
fills N-point stencil index arrays for all active voxels in the block.
WENO5 (N=19, R=3) is the motivating instance; architecture is stencil-agnostic.
Covers GPU inner loop, CPU SIMD batch design (SIMDw=16, probeLeaf dedup),
unified StencilLeafPtrs template, and reach-R generalization considerations.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
- §3: Stencil type as template parameter needs index→offsets direction
  (for-each-slot gather loop), not the offsets→index direction of WenoPt.
  Clarify relationship to BaseStencil/WenoStencil: geometry-only descriptor,
  no accessor coupling.
- §4: Kernel lambda signature std::array<ValueType,K> kernel(const ValueType* u);
  output is homogeneous std::array (not tuple); K=1 degenerates to scalar;
  SoA output layout results[k][BlockWidth] for SIMD efficiency.
- Renumber §4-§8 → §5-§9; update open questions accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…notes

Adds a self-contained test (lift_test.cpp) exploring a generic SIMD-lifting
abstraction: given a scalar tuple→tuple kernel, liftToSimd<W> produces an
SoA-wide version that loops over W lanes and is the auto-vectorization target.

The motivating kernel is WENO5 normSqGrad (19-point stencil, matching
WenoStencil::normSqGrad from Stencils.h).  The six weno5() calls vectorize
cleanly; godunovsNormSqrd() blocks vectorization in two distinct ways
depending on how it is written:

  1. std::max / bool isOutside ternaries → "control flow in loop"
  2. float sign + fmaxf (no ternaries) → "no vectype for stmt" due to
     GCC's inability to see through std::tuple's recursive-inheritance
     struct layout in GIMPLE alias analysis

INVESTIGATION.md documents all experiments, findings, current blockers,
and proposed next steps (pointer-cache approach, Clang comparison, etc.).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Evangelos Sifakis <esifakis@gmail.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Introduce nanovdb::util::Simd<T,W> (simd_test/Simd.h) — a minimal
header-only SIMD abstraction backed by std::array<T,W> with arithmetic
operators, SimdMask, min/max, and where().  Mirrors the C++26 std::simd
interface for forward compatibility.

Rewrite the WENO5 normSqGrad kernel as a template on T:
- T=float  : scalar __hostdev__ path for GPU (one thread per voxel)
- T=Simd<float,W> : W-wide CPU path (one call per batch)

A single templated godunovsNormSqrd + normSqGrad definition serves both
execution contexts with no #ifdef, structurally matching Stencils.h.
Clang 18 vectorizes the Simd<float,16> instantiation (691 ymm instructions
in the hot function, assembly-verified); GCC 13 does not.

Update INVESTIGATION.md with the full scoreboard, both approaches, and
next steps (GCC intrinsics path, benchmarking, nanovdb/util/ integration).

Signed-off-by: Efstathios Sifakis <esifakis@cs.wisc.edu>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Auto-detect <experimental/simd> (Parallelism TS v2) via __has_include and
__cpp_lib_experimental_parallel_simd.  When available, Simd<T,W> and
SimdMask<T,W> become thin wrappers around fixed_size_simd / fixed_size_simd_mask,
delegating all arithmetic to the standard type.

The TS v2 where(mask, v) is a 2-arg masked-assignment proxy; wrap it into
the 3-arg select(mask, a, b) form expected by the kernels.

Verified with clang++-18 -std=c++26: both paths produce identical assembly
(1275 ymm instructions, PASS on all 16 lanes), confirming Clang optimizes
through the wrapper completely.

Signed-off-by: Efstathios Sifakis <esifakis@cs.wisc.edu>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Document the std::experimental::simd backend alongside the std::array
default, including the TS v2 where() adaptation, the auto-detection
mechanism, and the assembly comparison showing byte-for-byte identical
output between the two backends under Clang 18.

Update the vectorization results table and open questions accordingly.

Signed-off-by: Efstathios Sifakis <esifakis@cs.wisc.edu>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
StencilKernel.h — new prototype header:
  - BaseStencilKernel<T, SIZE>: owns mValues[], mDx2, mInvDx2; no grid coupling
  - WenoStencilKernel<T>: derives from above, provides normSqGrad()
  - WENO5<T> and GodunovsNormSqrd<T, MaskT>: free functions mirroring Stencils.h
  - T=float for GPU scalar path, T=Simd<float,W> for CPU batch path

lift_test.cpp — rewritten to use WenoStencilKernel<T> directly:
  - SIMD and scalar reference paths both instantiate the same class
  - dx passed to constructor; mValues populated via operator[]

Simd.h — refinements:
  - Simd<T,W> and SimdMask<T,W> in Backend A are now pure type aliases for
    stdx::fixed_size_simd / fixed_size_simd_mask (no wrapper struct)
  - element_aligned_tag / element_aligned: portable load/store tag, always
    present; aliases stdx::element_aligned_tag in Backend A, dummy struct in B
  - Backend B load constructor and store() accept element_aligned_tag (defaulted)
  - NANOVDB_NO_STD_SIMD opt-out flag to force Backend B

INVESTIGATION.md — updated:
  - Approach B section updated to reflect class hierarchy instead of free functions
  - Backend B GCC note: the struct-access failure was specific to Approach A's
    liftToSimd outer-lane loop; Backend B's fixed-count operator loops do vectorize
    on GCC when used with the Generic-T class hierarchy
  - New ymm tables for both backends under GCC (Backend A: 1267 total,
    Backend B: 619 total); both pass correctness

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…chPtrs)

Add full CPU batch neighbor-leaf resolution design to the planning doc:

- §6: Replace StencilLeafPtrs struct with layered design — shared 3×3×3 bit
  encoding for probedMask/ptrs[27], stencil-specific batchPtrs population
  (batchPtrs[4][SIMDw] for WENO5, batchPtrs[3][3][3][SIMDw] for box stencil),
  and GPU scalar design note kept separate.

- §8d: Update lazy-probe section to reference ptrs[27] and 27-bit probedMask;
  add batchPtrs population step (Phase 2) after the probeLeaf loop.

- §8e: Update computeNeededDirs direction table to use 3×3×3 bit positions
  (bits 4,10,12,14,16,22 for WENO5 face neighbors).

- §8f/§8g: Minor notation updates to match ptrs[27] naming.

- §9: Resolve ptrs-layout and nExtraLeaves open questions; add prototype scope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Standalone CPU-only executable that verifies the neighbor leaf resolution
design from StencilGather.md §8d–§8f:

- For each VBM block: calls decodeInverseMaps, recomputes nLeaves from
  jumpMap, then processes SIMDw=16 batches with the full probedMask /
  lazy-probeLeaf / batchPtrs[4][SIMDw] pipeline.
- Does not call computeStencil.  Instead verifies batchPtrs against a
  direct probeLeaf reference for all 18 non-center WENO5 stencil offsets
  that cross leaf boundaries.
- Passes at 0.1, 0.25, 0.5, 0.9 occupancy (2.3M–2.9M lane checks each).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Refactor computeNeededDirs to accept a pre-expanded Simd<uint32_t,SIMDw>
vector, moving sentinel/masking responsibility to the single gather site
where leafMask is known:

- kSentinelExpanded = expandVoxelOffset(292) = 0x41044104 (constexpr)
- Caller broadcasts sentinel to all lanes, overwrites leafMask lanes with
  real expandVoxelOffset() values before calling computeNeededDirs
- computeNeededDirs is now a pure add+reduce with no masking or cross-check

Carry trick (§8e): expandVoxelOffset packs lz/lx/ly into 6 guarded 3-bit
groups; a single vpaddd ymm × 2 + vpor + vpand + shuffle-tree detects all
six WENO5 directions simultaneously.  kExpandCarryK = 0x514530C3.

AVX2 codegen confirmed via objdump:
- computeNeededDirs: vpbroadcastd + 2×vpaddd ymm + vpor/vpand ymm +
  vextracti128/vpsrldq shuffle-tree, no branches or calls in hot path
- activeMask/leafMask in runPrototype: vpcmpeqd ymm × 4 + vmovmskps ymm × 2
- Sentinel broadcast: 0x41044104 literal → vpbroadcastd → 2×vmovdqa ymm

Always-on scalar cross-check at every computeNeededDirs call site.
verifyComputeNeededDirsSentinel() tests both the sentinel carry property
and the straddle-lane non-pollution scenario before runPrototype().

StencilGather.md §8e and §8f updated to match new API and codegen notes.
Phase 1 prototype marked complete in §9.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Documents the BatchAccessor — the SIMD-batch analog of ValueAccessor —
developed from the ex_stencil_gather_cpu Phase 1 prototype discussion.

Core concept: instead of caching the path to one leaf, cache the full
3×3×3 neighborhood of 27 leaf pointers around the current center leaf,
serving SIMDw voxels per call.

Key design elements documented:

Eviction policy: fires on none_of(leafMask) only — straddle lanes do not
evict. leafMask is the "partial-hit" signal with no scalar-accessor analog.

Prefetch coverage argument:
- WENO5 (R=3): 6 extremal taps (±R,0,0),(0,±R,0),(0,0,±R) are necessary
  and sufficient — equivalent to the computeNeededDirs carry trick
- Box stencil (R=1): 8 corner taps (±1,±1,±1) collectively cover all 26
  non-center directions for any voxel position in the batch

Three-tier API:
- prefetch<di,dj,dk>(vo, leafMask, treeAcc)
- cachedGetValue<di,dj,dk>(vo, leafMask)  — no treeAcc, cache assumed warm
- getValue<di,dj,dk>(vo, leafMask, treeAcc) — lazy combined (vanilla style)

Template <di,dj,dk> rationale vs runtime Coord: compile-time direction bit,
dead-axis elimination, VDB convention alignment; runtime Coord overload
provided for generic stencil adapters.

AVX2 profile: offset arithmetic (vpaddd ymm), lane split (vpcmpgtd ymm),
gather from ≤2 leaf pointers (vgatherdps×2 + vpblendvb) — both scalar
bottlenecks from Phase 1 prototype are eliminated.

StencilGather.md: add cross-reference to BatchAccessor.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Introduces BatchAccessor<BuildT,ValueT,VoxelOffsetT,LeafIDT,PredicateT>,
the SIMD analog of NanoVDB's ValueAccessor.  Instead of caching the path
to one leaf, it caches the 27-entry 3×3×3 neighbor pointer table around
the current center leaf and serves a SIMD batch of LaneWidth voxels per
call.

Key design decisions
--------------------
- Eager center: constructor and advance() populate mLeafNeighbors[13]
  directly (O(1), no probeLeaf), so cachedGetValue<0,0,0> is valid
  immediately and the probe loop never needs a center special-case.

- SWAR neededMask: prefetch<di,dj,dk> expands the 9-bit voxel offsets
  into a 15-bit packed form (lz@[0:2], lx@[6:8], ly@[12:14]) using SIMD
  bitwise ops, then adds a compile-time packed stencil offset and checks
  carry bits for crossing detection.  One vpaddw YMM instruction covers
  all 16 lanes; clang folds packed_d into the blend constants at compile
  time, reducing the expand+blend+add to 5 SIMD instructions.

- Heterogeneous where: Simd.h gains where<T,U,W>(SimdMask<U,W>, ...) so
  a PredicateT=SimdMask<float,W> can gate a VoxelOffsetT=Simd<uint16_t,W>
  blend without explicit casting.  Array backend uses a trivial bool loop;
  stdx backend converts via a bool[] round-trip.

- Correctness verified in-process: stencil_gather_cpu.cpp integrates
  BatchAccessor as an alternate execution path and cross-checks all 18
  WENO5 tap directions against direct tree references (12.3M lane checks).

Simd.h additions (array backend)
---------------------------------
- SimdMask<T,W>: converting constructor from SimdMask<U,W>
- Simd<T,W>: operator|, &, ^, <<(Simd), >>(Simd)
- where<T,U,W>: heterogeneous mask overload (both backends)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Simd.h:
- Add scalar_traits<T> / scalar_traits_t<T>: extracts element type from
  plain scalars (identity) and Simd<T,W> (T); used by BatchAccessor
  static_asserts and as the shift-count type for the new uniform-shift ops.
- Add Simd::operator<<(T) / operator>>(T): uniform scalar shift (all lanes
  by the same immediate).  Maps to vpsllw imm8 / vpsrlw imm8 on x86 —
  distinguished from the existing per-lane Simd<<Simd overload which would
  require the nonexistent vpsllvw instruction for 16-bit lanes.

BatchAccessor::prefetch SWAR section:
- Replace hard-coded uint16_t/uint32_t casts with VoxelOffsetScalarT
  (= scalar_traits_t<VoxelOffsetT>) so the code is correct for any
  unsigned 16+-bit instantiation, not just Simd<uint16_t,W>.
- Add class-level static_asserts (unsigned + sizeof >= 2) with explanatory
  messages referencing the SWAR carry-detection contract.
- Remove static_assert(LaneWidth >= 16): was a performance aspiration,
  not a correctness requirement; SWAR works for any LaneWidth >= 1.
- Use 'auto' for expanded / packed_lc / packed_sum (type already expressed
  by the initializer); keep explicit uint32_t for hor_or/hor_and/s where
  the width is a deliberate semantic choice.
- Replace kMask15 hex literal 0x71C7u with 0b111'000'111'000'111u (binary
  makes the three 3-bit mask fields and three 3-bit gaps visually explicit).
- Use vo << VoxelOffsetScalarT(9) (uniform shift) instead of
  vo << VoxelOffsetT(VoxelOffsetScalarT(9)) (broadcast-then-per-lane).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
@danrbailey
Copy link
Copy Markdown
Contributor

I know that this is marked as not for review yet, but I just wanted to comment and say that specifically there is one aspect of this PR which is something I've wanted in OpenVDB for a long time. I brought this up in our TSC meetings again recently. A lot of the time when we use the ValueAccessor, it is not true random access, but stencil access patterns and the fact that the ValueAccessor is only able to store one leaf node means you end up thrashing the cache in the ValueAccessor when accessing different neighboring leaf nodes along leaf boundaries.

I have done a little experimentation in this area and the lazy leaf neighbor pattern that you use here is what I was envisioning, though I was contemplating whether a templated stencil argument (ie 6-neighbor vs 27-neighbor) may provide improved performance depending on the use case. I also hadn't settled on a name - NeighborAccessor, StencilAccessor, etc.

I would be very interested in comparing performance of a similar structure for OpenVDB that doesn't include the SIMD logic to compare with performance of the ValueAccessor (that also does not use SIMD). If it outperforms the ValueAccessor, I think we should look to switch some of our neighbor use cases across.

The other big issue with the ValueAccessor IMO is the tight coupling with the Grid and Tree (getValueAndCache() etc). This hurts impacts compile times and a new acceleration structure that extends the ValueAccessor in a new direction might be a good opportunity to revisit this and the attach/release accessor mechanism. I know NanoVDB took a slightly different approach on these trade offs.

(@kmuseth - a good topic for a future TSC meeting I think)

sifakis and others added 9 commits April 16, 2026 10:33
…lingZeros

- Add 2-argument where(mask, target) = value proxy (stdx and array backends):
  stdx-style masked assignment; encourages GCC to emit vpblendvb
- Add util::reduce(v, op) to both backends: tree-reduces to a scalar with
  std::bit_or<>{} / std::bit_and<>{} etc.; replaces scalar horizontal loop
- Add scalar reduce(T, BinaryOp) identity overload for W=1 path
- Add util::countTrailingZeros(uint32_t) to nanovdb/util/Util.h: __hostdev__,
  CUDA/HIP/__builtin_ctz/MSVC/De Bruijn dispatch; removes ad-hoc ctz from Simd.h
- BatchAccessor: use 2-arg where for packed_lc blend, util::reduce for hor_or/and,
  util::countTrailingZeros in toProbe loop, hoist root ref out of loop body

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Replace the per-lane scalar loop in cachedGetValue with a fully SIMD
gather chain that populates offsets, prefixSums, and maskWords without
any scalar iteration over lanes.

Pipeline (all in Simd<T,W>):
  packed_sum (uint16_t)
    → ×1129 → >>10 → &31        : d_vec (0..26), stays uint16_t —
                                   bits [10:14] of product lie below bit 16
                                   so modular uint16_t multiply is exact
    → gather(mNeighborLeafIDs)   : leaf_id_vec (uint32_t)
    → ×kStride                   : raw_idx (int32_t, null lanes → 0)
    → gather(offset_base)        : mOffset per lane
    → gather(prefix_base)        : mPrefixSum packed, then shift-extract field w
    → gather(mask_word_base + w) : valueMask().words()[w] per lane

Switch mLeafNeighbors[27] (const LeafT*) to mNeighborLeafIDs[27] (uint32_t)
with kNullLeafID = ~uint32_t(0) sentinel, enabling the flat-base SIMD gather
pattern.  prefetch and advance updated accordingly.

Add simd_cast<DstT>(Simd<SrcT,W>) to Simd.h for widening (uint16_t → int32_t,
uint16_t → uint64_t, uint32_t → int32_t) used in gather index construction.

Debug cross-check (#ifndef NDEBUG) validates all three vectors against the
scalar reference path; 12M+ lane checks pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
§8a was stale ("prefetch uses a scalar loop") — prefetch has no per-lane
loop. Update with accurate description of its two-phase structure:

- SWAR expansion + vpblendvb + vpaddw: fully in YMM across all LaneWidth
  lanes (vpsllw, vpor, vpand, vpblendvb, vpaddw)
- Horizontal reduction: unavoidable vextracti128 + vpand/vpor tree to
  produce scalar hor_and / hor_or for the per-axis crossing decision

Include the actual Release assembly (ex_stencil_gather_cpu, -O3 -mavx2)
confirming the YMM path survived the mLeafNeighbors → mNeighborLeafIDs
encoding change.

§8c: update nullptr sentinel language to reflect kNullLeafID / valid_u32
mask in the SIMD gather chain.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
- Promote SWAR encoding literals to class-scope uint16_t constants:
  kSwarXZMask (0x1C07), kSwarYMask (0x00E0), kSwarSentinel (4|4<<5|4<<10).
  Shared between prefetch and cachedGetValue; implicit conversion to
  VoxelOffsetT at Simd construction time.
- Add direction-extraction local constants in cachedGetValue:
  kSwarCarryMask (0x6318), kDirMul (1129), kDirMask (31).
- Rename w_vec -> wordIndex for clarity.
- Move U64Traits alias inside #ifndef NDEBUG where it is only used.
- Replace non-ASCII characters (em dashes, arrows, element-of, comparison
  operators) with ASCII equivalents throughout.
- Add explicit parentheses around the d_u16 shift-then-mask chain to make
  operator precedence unambiguous.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…liases

Simd.h:
- Add scalar simd_cast<DstT>(SrcT) overload (degrades to static_cast).
- Add scalar 2-arg where(bool, T&) masked-assignment proxy matching the
  SIMD WhereExpression form.
- Add scalar gather(const T*, int32_t) and gather_if(T&, bool, ...) to
  complete the scalar overload set alongside the existing where(bool,T,T).

BatchAccessor.h:
- Remove LeafIDT template parameter (was reserved/unused; now derived).
- Add private class-scope LeafIDVecT and LeafDataVecT using conditional_t:
  plain uint32_t/uint64_t when LaneWidth==1, Simd<T,W> otherwise.
  This upholds the convention that scalar instantiations use underlying
  types directly rather than Simd<T,1> wrappers.
- Replace local U32T/U64T aliases in cachedGetValue with class-scope names.

stencil_gather_cpu.cpp:
- Drop the now-removed LeafIDT argument from the BAccT instantiation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
cachedGetValue is now fully vectorised end-to-end for ValueOnIndex grids.
The scalar leaf->getValue(offset) loop is removed; result is filled via a
2-arg where(isActive, result) = ... directly on the output argument so that
leafMask-clear and inactive-voxel lanes are never touched.

Key design decisions recorded in BatchAccessor.md §8e:

- tapLeafOffset_i64 widened to int64_t before *= kStride to avoid uint32_t
  overflow (kNullLeafID = 0xFFFFFFFF causes wild gather indices in uint32_t).
  simd_cast_if<int64_t>(dst, valid_u32, src) writes 0 for invalid lanes,
  keeping gather indices non-negative for vpgatherqq (signed int64_t).

- gather_if gains a MaskElemT template parameter to support heterogeneous
  masks: valid_u32 (SimdMask<uint32_t,W>) applied to uint64_t data fields.

- Activity check: ValueOnIndex::getValue returns 0 for inactive voxels.
  Detected as isActive = (maskWords & (1<<dest_yz)) != 0. Null-leaf lanes
  have maskWords=0 and are therefore implicitly handled by the same check.

- popcount uses a SWAR shift-and-add tree (popcount64 in Simd.h) rather than
  __builtin_popcountll: AVX2 has no 64-bit lane-wise popcount (VPOPCNTQ is
  AVX-512DQ only), and the scalar popcnt instruction is not vectorisable.

- mOffsetBase, mPrefixBase, mMaskWordBase promoted to class-level const
  pointers, computed once in the constructor, shared across all 18
  cachedGetValue instantiations in a WENO5 stencil gather.

Verified: debug build (-O0) + release build (-O3), 12 321 275 lane checks
across all 18 WENO5 non-center taps, zero mismatches.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Add §8f with assembly comparison between GCC 13 and Clang 18
(-O3 -DNDEBUG -march=native, cachedGetValue<1,0,0>, W=16):

- GCC does not inline any Simd.h helper (gather_if, simd_cast, where,
  popcount); emits 14 out-of-line calls and 13 vzeroupper transitions.
  gather_if body uses scalar vmovq/vpinsrq/vinserti128 — no hardware gathers.

- Clang inlines everything except popcount; emits 2 vpgatherdd + 12 vpgatherqq
  hardware gather instructions and only 2 vzeroupper transitions.
  popcount body (88 ymm instrs, pure SWAR vpsrlq/vpand/vpaddq) is also
  fully vectorized but remains out-of-line.

- 43 vpinsrb in Clang output are mask-widening cost for heterogeneous
  gather_if (SimdMask<uint32_t,16> → 4x SimdMask<uint64_t,4>).

Action item added to §10: [[gnu::always_inline]] on Simd.h helpers
would eliminate the GCC regression and fold popcount inline in both
compilers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…adeoffs

Expand §8f with analysis of three popcount strategies for the 16-lane
uint64_t case:

- SWAR shift-and-add tree (current): 88 instructions, all AVX2-friendly
  but port-intensive; stays in vector registers throughout.

- Scalar popcnt with extract/reassemble: ~56 instructions; popcnt is
  pipelined (1/cycle throughput on port 1, 3-cycle latency) so 16
  independent lanes retire in ~16 cycles, but the ymm↔GPR domain
  crossing adds ~2 cycles bypass latency per extraction and port 1
  serialises all 16 popcnts.

- vpshufb nibble-LUT + vpsadbw (recommended): ~40 instructions, no
  domain crossing, uses ports 0/5 and 5 (orthogonal to SWAR ports),
  standard compiler-generated AVX2 popcount pattern. Added as action
  item in §10.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Adds StencilAccessor<BuildT,W,StencilT> — a compile-time-parameterized SIMD
wrapper around BatchAccessor that owns the straddling loop, hull-prefetch
sequencing, and per-tap cachedGetValue blending for one VBM block.  Output
is mIndices[SIZE] of Simd<uint64_t,W>, one vector per stencil tap.

Includes Weno5Stencil (18 taps, 6-tap hull) and the findIndex constexpr
fold for compile-time getValue<DI,DJ,DK>() inverse-map lookup.

Also:
- BatchAccessor.h: add centerLeafID() getter used by StencilAccessor
- BatchAccessor.md: expand §8f assembly matrix (compiler × backend × ISA)
- stencil_gather_cpu.cpp: wire StencilAccessor into runPrototype with
  verifyStencilAccessor correctness checks; add rdtsc-based runPerf
- Simd.h: remove = {} default argument from element_aligned_tag overloads
- CLAUDE.md: add project build/architecture reference for Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
sifakis and others added 25 commits April 18, 2026 12:11
…util/

All three headers are now shared dependencies across multiple examples and
are no longer specific to a single example directory.  Consolidate into
nanovdb/nanovdb/util/ to match the existing NanoVDB utility layout.

- simd_test/Simd.h → nanovdb/util/Simd.h
- ex_voxelBlockManager_host_cuda/BatchAccessor.h → nanovdb/util/BatchAccessor.h
- ex_voxelBlockManager_host_cuda/StencilAccessor.h → nanovdb/util/StencilAccessor.h

Update all #include paths in BatchAccessor.h, StencilAccessor.h,
stencil_gather_cpu.cpp, and simd_test scratchpad files.

Also add HaloStencilAccessor.md: design document for the dense halo buffer
approach to CPU WENO5 stencil extraction (successor to StencilAccessor).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…sor.md to nanovdb/util/

Keep design docs co-located with their headers now that Simd.h,
BatchAccessor.h, and StencilAccessor.h live in nanovdb/nanovdb/util/.

VBM-specific docs (StencilGather.md, VBMImplementationKnowledge.md,
VoxelBlockManagerContext.md) remain in ex_voxelBlockManager_host_cuda/.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
INVESTIGATION.md, StencilKernel.h, and lift_test.cpp were investigation
artifacts from the superseded liftToSimd / generic-T approach that preceded
BatchAccessor and Simd.h.  Relevant findings are preserved in
nanovdb/util/BatchAccessor.md.  Untracked scratch files (assembly outputs,
binaries, codegen experiments) deleted alongside.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
The inline prefix-scan / SWAR-popcount investigation block (lines 374-438)
was an algorithm-exploration artifact that reimplemented decodeInverseMaps
internals rather than testing the public API.  Removes the block along with
the now-unused popcount32 helper and the immintrin.h include.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
- New LegacyStencilAccessor.h: scalar ReadAccessor-based reference,
  templated on the same StencilT as StencilAccessor.  Serves as a
  correctness oracle and performance baseline (mirrors the accessor
  path-eviction behaviour of OpenVDB's math/Stencils.h).

- ex_stencil_gather_cpu: rewrite runPerf to multi-thread both paths
  via util::forEach, with an XOR-checksum anti-DCE, nanovdb::util::Timer
  wall-clock timing, and a decodeInverseMaps-only baseline pass.  The
  decode step is ~3 ms of the ~127 ms end-to-end time (~2%), so the
  SIMD-vs-scalar comparison is dominated by the stencil-gather itself.

- examples/CMakeLists.txt: bump ex_stencil_gather_cpu flags from
  -mavx2 to -march=native.

- BatchAccessor.md §8h: full end-to-end codegen investigation on
  i9-285K Arrow Lake, 32 threads, 16.7 M active voxels.  Default GCC 13
  -O3 outlines 14 Simd.h helpers per cachedGetValue and outlines the
  18 per-tap calls from calcTaps, producing ~323 calls and ~282
  vzeroupper per 16-voxel batch — StencilAccessor runs at 7.5 ns/vox,
  losing to scalar LegacyStencilAccessor at 5.4 ns/vox.
  [[gnu::flatten]] on StencilAccessor::moveTo collapses the entire
  call tree into a single 77 KB inlined body: GCC drops to 3.7 ns/vox
  (2x), beating Clang's 4.3 ns/vox.  Flattening only
  BatchAccessor::{prefetch,cachedGetValue} is insufficient (4.9 ns/vox).
  W=8 cuts YMM spills by 86% but regresses GCC end-to-end because
  per-batch framing overhead amortises over fewer lanes.  Attributes
  not applied in the shipped headers — opt-in for StencilAccessor
  consumers that need peak GCC performance.

- StencilAccessor.md §8.1: short GCC codegen note pointing to §8h.

Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…af / tree-walk)

Decomposes LegacyStencilAccessor's 5.4 ns/voxel end-to-end cost into
three components by running two additional timed passes before the
full Legacy pass:

- framing-only: full Legacy loop structure with no accessor call, using
  the center Coord components as the anti-DCE sum seed.  Measures the
  decodeInverseMaps + Coord-compute + 18-iteration inner loop + store.

- center-hit x 18: calls mAcc.getValue() 18 times per voxel on distinct
  coords that all lie within the center voxel's leaf (an 8x2x1 slab
  parameterised by k in [0..17]).  Every call is a guaranteed leaf-cache
  hit, so there is no tree walk; the distinct coords prevent CSE.
  Measures framing + accessor cache-check + leaf-local lookup.

- full: existing LegacyStencilAccessor path, for reference.

Subtracting gives:
  cache-check + leaf-local lookup = center-hit - framing
  tree-walk cost (amortised over misses) = full - center-hit

Measured breakdown (i9-285K Arrow Lake, 32 threads, 16.7 M active voxels):
  framing       : ~0.25 ns/vox  (5%)
  cache+leaf    : ~0.94 ns/vox  (17%, 0.05 ns/tap — fully load-port pipelined)
  tree walk     : ~4.24 ns/vox  (78%, serialised adjacent-leaf re-descents)

Finding: tree walks on leaf-cache miss dominate LegacyStencilAccessor's
cost despite affecting only ~25% of taps.  Each miss is ~4.5 cycles
(single lower-node re-descent), but they chain serially — unlike the
leaf-local lookups which the OoO engine fully pipelines through the
three load ports.  This is the cost that a 27-leaf neighbour cache
(BatchAccessor-style) eliminates.

Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Replaces BatchAccessor's fully-SIMD 18-step gather chain (§8e) with a
hybrid design that keeps the naturally-SIMD work vectorised and hands
the per-tap leaf lookup to a scalar loop.  Motivated by the end-to-end
codegen analysis in BatchAccessor.md §8f / §8h:

- Without [[gnu::flatten]], GCC outlined the 14 Simd.h helpers inside
  cachedGetValue and the 18 per-tap calls from calcTaps, paying ~323
  function calls and ~282 vzeroupper per 16-voxel batch and losing to
  scalar LegacyStencilAccessor (7.5 vs 5.4 ns/voxel).
- With flatten, GCC ran at 3.7 ns/voxel but as a single 77 KB monolithic
  body per StencilAccessor instantiation — strong dependency on a
  compiler-specific attribute for competitive perf.

Hybrid split (BatchAccessor.md §8i):

Stays SIMD (native __m256i uint16 throughout — no _Fixed<W> aggregate
ABI, no gathers, no heterogeneous where-blends):
- prefetch<di,dj,dk>() — unchanged.
- Setup in cachedGetValue: SWAR expansion, packed_sum, base-32 direction
  extraction (d_u16), packed-layout -> 9-bit local offset extraction
  (localOffset_u16).

Goes scalar (via two util::store harvests and a bitmask leafMask):
- Per-lane pointer chase into mNeighborLeafIDs + mFirstLeaf.
- One leaf.getValue(offset) call per active lane — the LeafNode handles
  valueMask / mPrefixSum / popcount internally.

Public API changes:

BatchAccessor::cachedGetValue signature:
- before: void cachedGetValue(ValueT& result, VoxelOffsetT vo, PredicateT mask)
- after:  void cachedGetValue(ScalarValueT (&dst)[LaneWidth], VoxelOffsetT vo,
                              PredicateT mask)
The C-array output allows the scalar tail to write lane results with a
single mov, eliminating the 18× WhereExpression::operator= outlined body.

StencilAccessor:
- Storage: Simd<uint64_t,W> mIndices[SIZE] -> uint64_t mIndices[SIZE][W],
  now public.  Layout is part of the ABI.
- moveTo() returns void (active-lane info is leafIndex[i] != UnusedLeafIndex,
  already available to the caller — no duplicate mask).
- Removed getValue<DI,DJ,DK>() and operator[]; added a static constexpr
  tapIndex<DI,DJ,DK>() for reorder-safe compile-time named-tap access.
- Public API now contains zero Simd<>/SimdMask<> types.  Callers SIMD-load
  tap rows from mIndices[k] with whatever backend they choose, or iterate
  scalarly — we don't impose a choice.

Simd.h additions:
- util::store(v, p) — uniform store shim dispatching to stdx::copy_to on
  the stdx backend and to Simd::store on the array backend.

BatchAccessor.h additions:
- mFirstLeaf member (cached getFirstLeaf() base) for the scalar tail
  leaf lookup.

End-to-end perf (32 M ambient / 50% / 32 threads, i9-285K Arrow Lake):

  Variant                            GCC ns/vox   Clang ns/vox
  Old SIMD path, no flatten                 7.5            4.3
  Old SIMD path, +flatten on moveTo         3.7            4.3
  Hybrid (this commit), no flatten          5.1            4.9
  Hybrid +flatten on moveTo                 4.8            4.8
  LegacyStencilAccessor                     5.5            6.7

Without flatten, the hybrid is 31% faster than the old SIMD path on GCC
and within 0.2 ns/voxel on Clang, eliminating the 3x compiler spread
that §8f / §8h documented and making both compilers beat the scalar
LegacyStencilAccessor oracle.

The cost on GCC (1.4 ns/voxel vs SIMD+flatten 3.7) is recoverable by
re-applying flatten at the caller site — but the shipped code no longer
requires it for acceptable performance.

Correctness: 12M lane-checks across all 18 WENO5 taps pass against the
LegacyStencilAccessor oracle; XOR checksum matches across the full
16.7 M active voxel workload.

Docs:
- BatchAccessor.md §8e marked "historical"; §8i added with the hybrid
  design, perf matrix, and cleanup notes (gather/simd_cast/popcount
  helpers in Simd.h are no longer exercised and can be removed in a
  follow-up).
- StencilAccessor.md §8.1 rewritten to describe the Simd-free public
  API and the new caller pattern.

Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…lue: decomposition infra

Adds benchmarking infrastructure used to decompose the stencil-gather
pipeline into its real cost components via perf PMU counters.  None of
this ships in the default code path; it lives alongside the hybrid design
to support follow-up perf analysis and to preserve decomposition variants
that might be reused later.

BatchAccessor:
- Add mFirstLeaf cached base pointer (grid.tree().getFirstLeaf()) so the
  scalar tail in cachedGetValue can do leaf lookups without going back
  through mGrid.tree() each time.
- Add cachedGetValueInLeaf<di,dj,dk>: benchmarking-only variant of
  cachedGetValue that wraps taps to the center leaf via a mod-8 SWAR
  mask (kSwarFieldMask = 0x1CE7).  Used to isolate the hybrid's
  single-leaf floor cost with no cross-leaf variety.  Requires (di,dj,dk)
  in [0,7] per axis.

StencilAccessor:
- Add moveToInLeaf: counterpart to moveTo that calls cachedGetValueInLeaf
  instead of cachedGetValue.  prefetchHull is skipped (center leaf is
  always cached after construction/advance).  Benchmarking only — results
  have no geometric meaning.

ex_stencil_gather_cpu:
- Add CLI flags:
    --pass=<name>       run one pass in isolation.  Supported names:
                        framing, decode, center-hit, stencil, degenerate,
                        inleaf, legacy, legacy-branchless, all (default),
                        verify.  Needed for clean perf-stat attribution —
                        the default harness runs every variant back-to-back
                        and perf cannot restrict counters to a subrange.
    --threads=<n>       gate TBB parallelism via tbb::global_control.
                        Single-threaded runs give cleaner perf attribution;
                        taskset alone does not limit TBB's worker-thread
                        count.
- Add DegenerateStencil: 18 identical (0,0,0) taps.  Under CSE the
  per-tap work collapses to 1 evaluation per lane — isolates the hybrid's
  pipeline framing cost when the compiler can factor out all tap work.
- Add InLeafStencil: 18 distinct compile-time taps in [0,6] per axis.
  Combined with moveToInLeaf, exercises the full hybrid pipeline with
  18 distinct per-tap work items that all hit the center leaf — isolates
  the "same-leaf no-CSE" floor.
- Add four new timed passes (with anti-DCE checksums):
    degenerateUs            DegAccT .moveTo()
    inLeafUs                InLeafAccT.moveToInLeaf()
    framingUs               Legacy loop with no accessor call
    centerHitUs             Legacy loop with 18 distinct same-leaf coords
                            (CSE-defeated)
    legacyBranchlessUs      Legacy loop with leaf.getValue() replaced by
                            the unconditional formula
                            mOffset + prefix + popcount(maskWord & mask-1)
                            — drops the valueMask.isOn branch, giving a
                            3x speedup.
- decodeUs baseline already existed; kept.
- Each pass is guarded by wantPass(name) so --pass=<name> runs exactly
  one variant in isolation.
- Output extended: per-pass ns/voxel lines + a Legacy cost decomposition
  block (framing / cache+leaf / tree walk).

Architectural summary: the decomposition revealed that the true CPU
bottleneck is unpredictable branch mispredicts on the valueMask.isOn()
check inside LeafNode<ValueOnIndex>::getValue(offset), not L1 pressure
or tree-walk latency.  Detailed perf matrix, revised attribution, and
follow-up implications are documented in BatchAccessor.md §8j and
StencilAccessor.md §8.2 (separate commit).

Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…d attribution

Captures the findings of the perf-counter investigation made possible by
the decomposition infrastructure in the prior commit.  Preserves prior
claims as history (with targeted "Revision note" annotations pointing to
the superseding section), adds the new authoritative measurements and
decomposition, and updates the current-API sections to match the shipped
hybrid design.

BatchAccessor.md:

- §8g, §8h, §8i: prepend "Revision note" blocks pointing to §8j.  The
  original text is preserved intact — only a short header note flags
  which claims were subsequently refined.  Affected claims:
    * §8g cycle budget model: gather chain assumed to dominate; actually
      pipelined away by OoO, real dominant cost is branch mispredicts.
    * §8h "multi-leaf L1 pressure accounts for cross-leaf overhead":
      refuted, L1 miss rate is flat across all variants.
    * §8i "27-leaf cache eliminates tree walks" magnitude overstated;
      measured savings are ~0.3 ns/voxel, not ~3-4 ns/voxel.

- §8j (new, ~230 lines): full perf-counter chronicle.  Nine subsections:
    8j.1 Motivation
    8j.2 Methodology (--pass/--threads flags, PMU setup, P-core pinning)
    8j.3 Measurement matrix for 6 variants (IPC, branch-miss rate,
         L1 miss rate, ns/voxel on a single P-core single-threaded)
    8j.4 Identifying valueMask.isOn(offset) as the real source
    8j.5 Branchless experiment (legacy-branchless pass) — 3x speedup
         on Legacy from 5.6 to 2.0 ns/voxel at 32 threads
    8j.6 Revised 5.4 ns/voxel attribution: ~3.6 isOn mispredicts,
         ~0.75 leaf-local work, ~0.3 tree-walk diff, ~0 L1 pressure
    8j.7 NANOVDB_USE_INTRINSICS is a no-op on GCC 13 at -O3 -march=native
         (SWAR pattern-matched to popcnt) — enable for portability
    8j.8 Architectural implications (branchless LeafNode::getValue
         proposal, revised halo value proposition, hybrid rationale)
    8j.9 Historical correction log — single-table index of revised
         claims with their source sections

- §10 Status: add the hybrid refactor and §8j findings to Completed;
  add "branchless LeafNode<ValueOnIndex>::getValue" as the top Remaining
  item — the single biggest cheap CPU-side speedup available.

StencilAccessor.md:

- §8.2 (new): summary of the PMU finding pointing to BatchAccessor.md
  §8j.  Contains the headline measurement matrix, branchless-experiment
  delta, revised 5.4 ns/voxel attribution, and consequences for design
  decisions.

- §9: rewrite to document the current API (tapIndex<DI,DJ,DK>() +
  public mIndices[SIZE][W]).  Prior getValue<>()/operator[] API was
  removed in the hybrid refactor; that evolution is noted in-section.

- §10 Caller-side usage pattern: rewrite to match current API — moveTo
  returns void, active-lane check via leafIndex[i] != UnusedLeafIndex,
  three direct mIndices access idioms (scalar iteration, SIMD row load,
  compile-time named-tap access).

- §11 Ownership summary: mIndices[SIZE] -> [SIZE][W]; tap-fold row
  updated to reflect direct writes (no where-blend).

- §12 Design decisions 1 and 3: preserve original rationale, append
  "Evolution" / "Revised" notes explaining the shift to void moveTo
  return (decision 1) and removal of operator[] (decision 3) in favour
  of public mIndices + tapIndex<>().  Decisions 2, 4-6 unchanged.

Net: +550 lines of documentation capturing the investigation as a
chronicle, with every revised claim traceable to both the superseded
prior section and the measurement that refined it.

Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…ly ReadAccessor

Adds a branchless sibling to LeafData<ValueOnIndex>::getValue, and switches
the CPU stencil-gather test scaffolding to use a leaf-only ReadAccessor
variant.  Motivated by the PMU-counter investigation in BatchAccessor.md
§8j: the single data-dependent branch inside LeafData::getValue --
  if (!(word & bit)) return 0;
-- is the dominant cost for hot stencil loops over OnIndex trees, driving
branch-mispredict rates to 8-10% on random-access workloads (IPC ~2) and
still to ~1.7% on spatially-coherent narrow-band workloads (IPC ~4.2
vs a potential ~5.5).

NanoVDB.h: LeafData<ValueOnIndex>::getValueBranchless(uint32_t i)

  Same semantics as getValue -- returns 0 for inactive voxels, otherwise
  mOffset + prefix9(n) + popcount(w & (bit - 1u)) -- but replaces the
  early-return guard with an arithmetic mask gate:
    uint64_t mask = (w & bit) ? ~0ull : 0ull;  return mask & sum;
  The ternary over two constant arms compiles to test + cmov on x86, not
  a conditional jump.  Verified on GCC 13 at -O3 -march=native.

  Semantics are bit-for-bit identical to getValue (checksums match across
  both synthetic random-50% and real narrow-band workloads including OFF
  voxels).

  Scoped to LeafData<ValueOnIndex> (not LeafNode) as an opt-in expert
  path for neighbourhood-aware hot loops (BatchAccessor-class cachers,
  HaloStencilAccessor-style precomputation).  The generic LeafNode and
  ReadAccessor APIs are unchanged.

LegacyStencilAccessor: use ReadAccessor<BuildT, 0, -1, -1>

  For GetValue workloads the upper/lower cache slots of DefaultReadAccessor
  are never consulted (OpT::LEVEL == 0 falls straight to mRoot on a leaf
  cache miss; NanoVDB.h:5387) and are only passively written during the
  walk.  Switching to a 1-level leaf-only cache removes that passive
  bookkeeping.

  Measured effect: Legacy path sees no change (its pipeline is already
  stalled on branch mispredicts, extra stores overlap for free).  The
  branchless path consistently improves by 4-9% -- when IPC is near
  peak, every retired instruction counts.

ex_stencil_gather_cpu updates
  - legacy-branchless pass now calls leaf->data()->getValueBranchless
    instead of the hand-inlined formula it used during investigation.
    The checksum now matches LegacyStencilAccessor exactly (the hand-
    inlined version wrote non-zero garbage for OFF voxels); the new
    method is semantically a drop-in replacement.
  - center-hit and legacy-branchless passes both use the leaf-only
    ReadAccessor construction.

Measured speedup (32 M ambient / 50% random occupancy / i9-285K, 24
cores wall clock):
  Legacy:              94.5 ms
  Legacy branchless:   34.3 ms   (2.8x)

Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…input

Counterpart to ex_stencil_gather_cpu, which uses a procedurally generated
random-occupancy domain (a pathological worst case for branch prediction
inside LeafData::getValue).  This example loads an openvdb level-set
FloatGrid from disk, converts it to a NanoVDB ValueOnIndex topology grid
plus a separately-allocated float sidecar, and runs the same perf-
decomposition battery on a workload with realistic spatial coherence.

Purpose: quantify how much of the isOn-branch-mispredict cost (see
BatchAccessor.md §8j) survives on real narrow-band traversals where
consecutive tap queries mostly land on active voxels.  Findings to be
documented in the follow-on commit.

Pipeline:
  openvdb::io::File(path)                                 -- disk load
  -> openvdb::FloatGrid                                   -- first FloatGrid,
                                                             or --grid=<name>
  -> nanovdb::tools::CreateNanoGrid<openvdb::FloatGrid>   -- builder
       .getHandle<ValueOnIndex, HostBuffer>(channels=0)   -- topology only
       .copyValues<ValueOnIndex>(sidecar.data())          -- float sidecar
  -> validateSidecarOrdering()                            -- 1000-voxel
                                                             sanity check
  -> runPrototype + runPerf (identical to ex_stencil_gather_cpu)

The separated-sidecar path avoids the double-conversion alternative that
would be needed to opt out of the grid's blind-data.  Sidecar ordering is
verified at startup by comparing floatGrid.getValue(ijk) against
sidecar[indexGrid.tree().getValue(ijk)] on 1000 random active voxels
(opt-out via --skip-validation).

The sidecar is plumbed through but not yet consumed by any stencil path
-- it exists for future "fetch values via the sidecar" work.

CLI: narrowband_stencil_cpu <path.vdb>
                            [--grid=<name>]
                            [--pass=<name>]       // same set as
                                                 //   ex_stencil_gather_cpu
                            [--threads=<n>]
                            [--skip-validation]

Both the legacy-branchless pass (calls LeafData::getValueBranchless,
introduced in the prior commit) and the center-hit / legacy-branchless
accessor construction (leaf-only ReadAccessor<BuildT, 0, -1, -1>) match
the updated ex_stencil_gather_cpu conventions.

Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…nd findings, leaf-only accessor

Follow-on chronicle for the investigation in §8j.  Three things happened
since: (1) the hand-inlined branchless variant was promoted into
NanoVDB.h proper as LeafData<ValueOnIndex>::getValueBranchless; (2) a
real narrow-band benchmark (ex_narrowband_stencil_cpu, taperLER.vdb) was
added to validate the finding on a workload with realistic spatial
coherence; (3) we noticed the scaffolding was using a 3-level ReadAccessor
where only the leaf-level cache can contribute and switched to
ReadAccessor<BuildT, 0, -1, -1>.

BatchAccessor.md §8k (new, ~170 lines) — the new findings:

- §8k.1 The new API: getValueBranchless code, semantics (identical to
  getValue; checksum matches byte-for-byte), scope decision (LeafData
  only, not LeafNode/ReadAccessor), note that the shipped version
  includes the OFF-returns-0 mask gate that the hand-inlined
  benchmarking variant omitted.
- §8k.2 The new ex_narrowband_stencil_cpu benchmark — .vdb loading,
  topology-only IndexGrid + separately-allocated float sidecar path,
  startup sidecar validation.
- §8k.3 Measurement matrix (single-P-core, PMU counters): Legacy vs
  getValueBranchless on synthetic vs narrow-band.  Refines §8j:
  narrow-band is NOT pathological for branch prediction (1.74% miss
  rate, IPC 4.22); random-access is (8.07% miss rate, IPC 1.96).
  getValueBranchless recovers near-peak IPC on both.
- §8k.4 Accessor cache-level finding: 3-level ReadAccessor's upper and
  lower cache slots are never consulted for GetValue.  Switching to a
  1-level leaf-only accessor gives 4-9% additional speedup on branchless
  paths (no measurable effect on Legacy, which is backend-bound on
  mispredicts).  Scope caveat: benchmark-only; the library default stays
  at 3-level for mixed workloads.
- §8k.5 Updated end-to-end headline numbers: 24-core Arrow Lake,
  narrow-band 85 → 60 ms (1.4×), synthetic 95 → 34 ms (2.8×).
- §8k.6 What the §10 Remaining list should now show as complete, and
  future follow-ons (ProbeValue variant that reuses getValueBranchless,
  steering-team proposal).

BatchAccessor.md §10 "Completed" updated with three new bullets:
- getValueBranchless shipped (§8k)
- ex_narrowband_stencil_cpu shipped (§8k.2)
- Leaf-only accessor switch in benchmark scaffolding (§8k.4)
Removed the "branchless LeafNode::getValue" item from "Remaining" (done,
at LeafData level per scope decision).

StencilAccessor.md §8.2 updated: the "cheap architectural win" bullet
now points at the shipped getValueBranchless rather than the "would be"
drafting.  The HaloStencilAccessor bullet reframed: its remaining
advantage is "zero per-tap work at query time" now that the mispredict
storm is already addressable by getValueBranchless.  Cross-references
§8j and §8k.

Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Fold the branchless reformulation (previously shipped as a sibling
getValueBranchless in 8a24ddf) into getValue as its default body, and
gate the pre-2026 branchy form behind NANOVDB_USE_BRANCHY_GETVALUE.
The branchless form is strictly faster or within ~0.1 ns/vox on every
workload measured, and the OFF path is preserved bit-for-bit via the
ternary-constant mask-AND -- checksums match.

Update the two CPU stencil-gather examples to call the single getValue
entry point, and refresh BatchAccessor.md §8k + StencilAccessor.md §8.2
to describe the toggle-based final API and the API evolution history.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…aths

The 8-pass decomposition battery (decode, stencil, degenerate, inleaf,
framing, center-hit, legacy, legacy-branchless) was load-bearing while
attributing Legacy's cost between tree-walks, leaf-local work, and
isOn mispredicts (BatchAccessor.md §8g-§8k).  That attribution is now
settled and getValue is branchless by default, so the diagnostic
variants no longer earn their keep.

Retain four passes in both ex_stencil_gather_cpu and
ex_narrowband_stencil_cpu:
  - decode    : decodeInverseMaps cost baseline
  - framing   : loop + coord compute + anti-DCE, no accessor call
  - stencil   : hybrid StencilAccessor (SIMD cache + scalar getValue tail)
  - legacy    : LegacyStencilAccessor (per-tap probeLeaf + getValue)

Drop: DegenerateStencil / InLeafStencil types and their accessor aliases,
the degenerate/inleaf/center-hit/legacy-branchless blocks, and the
"Legacy cost decomposition" printout that consumed center-hit.

Correctness unaffected (StencilAccessor vs LegacyStencilAccessor
cross-validation still runs and passes); checksums match between the
two shipped paths on both benchmarks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Added a tap-outer, voxel-inner variant of the Legacy path (same
leaf-only ReadAccessor, same probeLeaf + getValue mechanics, just the
nested loops swapped) as a new `legacy-transposed` benchmark pass in
both examples.  Checksums match the voxel-outer `legacy` pass byte-
for-byte on both synthetic and narrow-band workloads.

During the experiment we hit a GCC inlining pitfall: a runtime-args
inner lambda `[&](int di, int dj, int dk)` invoked 18 times via
parameter-pack fold did *not* get inlined (lambda body contains a
128-iteration loop; GCC's inline budget × 18 is exhausted).  Result:
18 explicit call instructions to a 542-byte processTap function with
6-register prologue/epilogue per call, plus tap offsets becoming
runtime register arguments (one spilled to stack) — accounting for
~13 % of the observed slowdown vs. Legacy.  Fix is a templated lambda
`[&]<int DI, int DJ, int DK>() [[gnu::always_inline]]` dispatched via
`.template operator()<...>()`.  Standalone processTap symbol vanishes;
transposed body grows 4.4 → 9.8 KB, matching Legacy's 10.5 KB.

Measured at ~32M active voxels on i9-285K (24 threads):
  - Narrowband taperLER: Legacy 2.2 ns/vox vs Transposed 2.1 ns/vox
    (marginal, within the ~10 % noise floor)
  - Synthetic 64M/50%:   Legacy 2.4 ns/vox vs Transposed 2.8 ns/vox
    (+19 %, outside noise)

Implementation verdict: LegacyStencilAccessor's voxel-outer moveTo
stays the default.  Tap-outer has no consistent perf advantage, and
voxel-outer wins on cleanliness (self-contained accessor, no scratch
arrays, no compiler-inlining fragility, 1:1 mapping to the stencil-
operator mental model).  `legacy-transposed` kept as a benchmark pass
for future reference.

BatchAccessor.md §8l captures the experiment, the inlining-pitfall
lesson, the measurement matrix, and the implementation-quality
rationale behind keeping voxel-outer as the production default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…WENO pipeline target

Our Weno5Stencil in StencilAccessor.h was missing the center tap
<0,0,0>, defining only the 18 off-center ±1,±2,±3 per-axis taps.  The
canonical WenoStencil in nanovdb/math/Stencils.h has 19 taps with
WenoPt<0,0,0>::idx = 0, followed by x-axis (1..6), y-axis (7..12),
z-axis (13..18).  Fixed our ordering to match exactly, so any caller
using the canonical WenoPt index convention gets the same slot layout
as our Weno5Stencil::Taps.

The Hull unchanged: the center never crosses a leaf, so it contributes
nothing to the face-neighbour prefetch set (§4b monotonicity
argument).

Code impact is data-driven (SIZE = tuple_size_v<Taps>): the
StencilAccessor/LegacyStencilAccessor mIndices storage grows from
[18][W] to [19][W], the parameter-pack fold in fillTaps now expands to
19 calls, and the tap-outer `legacy-transposed` pass iterates 19 inner
loops.  No hard-coded 18s in the benchmarks.  Cross-path checksums
still match byte-for-byte (hybrid vs legacy vs transposed), now with
the center voxel's index included:
  - synthetic 64M/50%: 0x0000000001e1b860 (was 0x0000000000b2ce4fc)
  - narrowband taperLER: 0x000000001a38de2b (was 0x000000001f8ad71a)

Rough 19/18 ratio shows up in the timings: narrowband narrowed ~2-5%
across all three implementations (within the earlier noise floor);
synthetic transposed ticked up more (~13%) but stays within the
already-observed voxel-outer-vs-tap-outer spread.

---

Also documented the end-to-end CPU WENO5 target pipeline in
BatchAccessor.md §11.  This captures the three-phase structure the
experimentation in §5-§8 has been building toward:

  (1) Decode inverse maps (shipped)
  (2) Per-batch sidecar value assembly — the next phase:
      float mValues[Ntaps][W] via sidecar lookup, per W=4/8 batch
      (per-batch scope to keep mValues L1-resident, ~9.5 KB block-wide
      would be too big)
  (3) Full SIMD WENO over Simd<float, W>

Documented the sign-extrapolation convention for out-of-band taps
(magnitude = |sidecar[0]|, sign inherited from the next-inner tap on
the same axis), its loop-order implications (tap-outer / axis-major
/ ascending-|Δ|), and two open questions (distance-1 sign source,
cascade transitivity).

Flagged that the §8l voxel-outer-vs-tap-outer measurements do not
settle the Phase-2 loop order question: they were taken at
BlockWidth=128 inner-loop size over uint64 indices; the real Phase-2
inner loop will be W=4 or W=8 over floats.  Output-layout-match and
sign-extrap dependency both push toward tap-outer at that scale, but
a rerun at the real batch width is required before locking it in.

Plus small 18→19 comment sweeps in the live source; historical
measurement narratives in §8c-§8k left alone (they describe what was
measured at 18 taps at the time and don't update automatically).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Restructured the `legacy`, `framing`, and `legacy-transposed` passes in
both examples so that after `decodeInverseMaps` the work runs as an outer
`for batchStart in 0..BlockWidth step SIMDw` loop with an inner lane loop
over SIMDw=16.  The StencilAccessor pass was already batched (reference
structure); this aligns every other pass with it.

Notes:
- LegacyStencilAccessor / ReadAccessor instances stay outside the batch
  loop (one per TBB task) — same amortisation as before.
- `legacy-transposed`: `centers[]` and `s[]` scratch arrays shrunk from
  BlockWidth=128 to SIMDw=16; the templated `processTap` lambda's inner
  loop also runs over SIMDw lanes.

This is Stage 1 of the Phase-2 pipeline preparation (BatchAccessor.md
§11): Phase 2 will slot a sidecar-aware value assembly + SIMD WENO
kernel into the same batch loop, so every path needs that shape in place
first.

Perf impact (24-thread, i9-285K):
- StencilAccessor: unchanged (already batched).
- Legacy:          ~5 % slower on narrowband, noise-level on synthetic.
- Legacy-transposed: ~6-7 % slower both workloads — the per-batch
  `processTap` lambda fires 8x instead of once per block.

Checksums match byte-for-byte across all three passes on both workloads.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Stage 2 of the Phase-2 pipeline preparation (BatchAccessor.md §11).
Three new benchmark passes that assemble per-batch value + activity
matrices from the sidecar, plus a stand-in token op for anti-DCE:

  float values  [SIZE][SIMDw]   -- sidecar[idx]     (idx==0 -> background)
  bool  isActive[SIZE][SIMDw]   -- (idx != 0)

Token op: per active voxel, sum values[k][i] over taps with
isActive[k][i]==true, write to a second sidecar at the voxel's
VBM-sequential index (firstOffset + bID*BlockWidth + lane), which
equals the center voxel's ValueOnIndex by VBM construction.

Supporting changes:
- convertToIndexGridWithSidecar: set sidecar[0] = floatGrid.background()
  after copyValues, so sidecar[idx] is unconditionally valid.
- runPerf: takes `const std::vector<float>& sidecar`; allocates a local
  outputSidecar (same shape as input) for the new passes.

Pass variants:
- sidecar-legacy:     LegacyStencilAccessor (scalar moveTo per voxel)
- sidecar-stencil:    StencilAccessor       (hybrid SIMD+scalar moveTo
                      per batch, reads mIndices[k][i] directly)
- sidecar-transposed: tap-outer ReadAccessor probeLeaf + getValue

All three produce identical checksums (cross-validation):
- checksum 0xcfbff7c8 on taperLER.vdb

End-to-end timings (24 threads, i9-285K):
  sidecar-legacy:     130 ms  (4.1 ns/voxel)
  sidecar-stencil:     97 ms  (3.0 ns/voxel)  -- best
  sidecar-transposed: 126 ms  (4.0 ns/voxel)

StencilAccessor's SIMD moveTo amortises both the direction-decode and
the per-lane scalar work; its tap-outer contiguous mIndices[k][i] rows
also let the sidecar gather and value-array fill vectorise cleanly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…polation

New header nanovdb/nanovdb/util/WenoStencil.h defines WenoStencil<W>,
a 19-tap WENO5 stencil value container parameterised on SIMD lane width.
Same storage declaration holds for both scalar (W=1, GPU) and SIMD
(W>1, CPU batched) instantiations via std::conditional_t:

  W == 1 :  ValueT = float           PredT = bool
  W >  1 :  ValueT = float[W]        PredT = bool[W]

Storage is plain C arrays (mValues[SIZE], mIsActive[SIZE]) so callers
fill lane-by-lane with the same syntax in both cases; an internal
addr() helper bridges the W=1 scalar-reference / W>1 array-decay
asymmetry for SIMD load/store inside extrapolate().

extrapolate(absBackground) repairs out-of-band lanes in-place: for each
tap k with mIsActive[k][i] == false, writes copysign(absBackground,
mValues[innerTap][i]) via a hardcoded ascending-|Δ| cascade (18 pairs):

  |Δ|=1 taps  <--  center (0,0,0)
  |Δ|=2 taps  <--  |Δ|=1 on same axis
  |Δ|=3 taps  <--  |Δ|=2 on same axis

Cascade order guarantees the inner tap is already resolved when the
outer tap is processed, so sign propagates through |Δ|=1 -> |Δ|=2 ->
|Δ|=3 without special casing.

Implementation uses only Simd.h primitives (Simd<float,W>, SimdMask,
where, operator>, unary minus, store) which collapse cleanly to scalar
code under W=1 fixed_size<1>.  No Simd.h changes needed.

Also adds a new benchmark pass `sidecar-stencil-extrap` in
ex_narrowband_stencil_cpu: same fill as sidecar-stencil, then calls
stencil.extrapolate(|sidecar[0]|), token sum over ALL 19 taps
(previously sum was gated by isActive).

Results on taperLER.vdb (24 threads, i9-285K):

  Sidecar (stencil)       :  97.3 ms  (3.1 ns/voxel)  checksum=0xcfbff7c8
  Sidecar (stencil+extrap): 101.8 ms  (3.2 ns/voxel)  checksum=0x371273d0

Extrap cost: +4.5 ms / 31.8M voxels = ~0.14 ns/voxel (~50 cycles for
18 SIMD blend pairs).  Checksum for the extrap pass stable run-to-run;
differs from sidecar-stencil as expected since the sum now includes
extrapolated values instead of excluding out-of-band lanes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New WenoStencil.md captures the Stage-3 design rationale for
nanovdb/nanovdb/util/WenoStencil.h:

- §2  single-source scalar/SIMD via std::conditional_t storage
      (ValueT = float vs float[W], PredT = bool vs bool[W])
- §3  the addr() bridge for W=1 scalar-ref vs W>1 array-decay at
      SIMD load/store sites; rejected alternative (Simd.h reference
      overloads) and why
- §4  extrapolation semantics (OOB problem, ascending-|Δ| cascade,
      hardcoded kPairs table for Weno5)
- §5  extrapolate() implementation walkthrough + per-pair cost model
- §6  API usage (fill / extrapolate / named-tap access)
- §7  future work: reconstruct() method, Weno5Stencil policy
      consolidation, generic StencilPolicy parameterisation
- §8  relation to BatchAccessor.md / StencilAccessor.md / HaloStencilAccessor.md

BatchAccessor.md §11.6 (Implementation status) added to the
"Target pipeline" section summarising Stage 1 (batch-restructure),
Stage 2 (sidecar variants), and Stage 3 (WenoStencil + extrapolate)
with commit SHAs, perf tables, and a trimmed "Remaining" list
pointing at the Phase-3 WENO5 arithmetic kernel as the next step.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ffers

Refactor WenoStencil<W> to hold first-class Simd values directly, rather
than raw float[W]/bool[W] C arrays with an addr() bridge at every SIMD
load/store site.  The class becomes pure-compute: no fill-side storage,
no addr() ceremony, no element_aligned tags inside the kernel bodies.
The caller owns whatever scalar-scatter target shape is natural for its
source (typically a pair of stack-local alignas(64) C arrays at CPU
W>1; no intermediate buffer at all on CUDA W=1).

  FloatV values  [SIZE];      // Simd<float, W>     — first-class storage
  MaskV  isActive[SIZE];      // SimdMask<float, W>
  float  mDx2, mInvDx2;       // scalar grid constants, broadcast on use

At W=1 the Simd types collapse to plain float/bool under the array
backend — the CUDA per-thread code path reads as pure scalar
arithmetic, which was the primary motivation for the refactor:
scalar is the production workhorse on GPU, not a degenerate
convenience of the CPU SIMD design.

The earlier Approach-1 design (std::conditional_t<W==1, float, float[W]>
+ addr() helper) forced W=1 to load `Simd<float, 1>` via address-of-
scalar, which propagated SIMD-ish ceremony into every __hostdev__
method.  Approach 2 sidesteps this: at W=1 there is no Simd wrapper at
the storage level at all.

Also adds `normSqGrad(iso = 0.f)`, the fifth-order upwind Godunov
norm-square-gradient method.  Line-for-line transliteration of the
scalar ground-truth `WenoStencil<GridT>::normSqGrad(isoValue)` in
`nanovdb/math/Stencils.h`, adapted to generic-T:

  - Six axial WENO5 reconstructions via free-function
    `nanovdb::detail::WENO5<T>` — mirrors `nanovdb::math::WENO5<T>`
    identically, wrapping integer literals in RealT() for Simd
    broadcast.
  - `nanovdb::detail::GodunovsNormSqrd<T, MaskT>` computes both
    outside/inside branches unconditionally and blends via
    util::where; replaces the scalar ground-truth's runtime if/else
    so the SIMD path has no control-flow divergence across lanes.
  - Final scalar mInvDx2 broadcast to FloatV at the single
    combinator multiplication (free on x86; identity at W=1).

sidecar-stencil-extrap pass in ex_narrowband_stencil_cpu updated to
the new API:

  - Caller allocates stack-local raw_values[SIZE][SIMDw] /
    raw_active[SIZE][SIMDw] inside the TBB task body.
  - Scalar scatter fill populates the raw buffers from the sidecar
    via stencilAcc.mIndices[k][i].
  - SIMD load-per-tap moves the data into stencil.values[k] /
    stencil.isActive[k].
  - stencil.extrapolate(absBackground) operates in-place on the
    Simd values.
  - The token sum over 19 taps now runs as a pure Simd accumulate,
    then stores to a stack-local sum_lanes[SIMDw] for the per-voxel
    scalar write to the output sidecar (gated by leafIndex).

Checksum matches byte-for-byte (0x371273d0 on taperLER.vdb), timing
within noise of the pre-refactor baseline (~82 ms / 2.6 ns/voxel
end-to-end on 24 threads).

WenoStencil.md rewritten: §2 now describes Simd-typed storage and
the caller-owned fill-buffer convention (the addr() bridge section
is removed); §4 documents normSqGrad including the generic-T
WENO5 / GodunovsNormSqrd helpers and their relationship to the
scalar ground-truth; §5.1/§5.2 show the two usage patterns (CPU
SIMD with raw buffers + explicit load; CUDA scalar with direct
per-tap assignment); §6 diagrams the ownership boundaries; §7.1
flags that normSqGrad perf hasn't been measured yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…ne pass

Adds the full Phase-2+3 WENO5 pipeline as a new benchmark pass:
decode → sidecar fill → load into WenoStencil<W> → extrapolate →
normSqGrad → store |∇φ|² to output sidecar.

The pass replaces the debug "sum over 19 taps" intermediate of
sidecar-stencil-extrap with the real Phase-3 arithmetic — what goes
into the output channel is now the actual norm-square-of-gradient,
not an anti-DCE token.  Existing sidecar-stencil-extrap kept as a
perf-decomposition waypoint that isolates the cost of load+extrapolate
alone.

Grid voxel size comes from grid->voxelSize()[0] (isotropic assumption
for narrow-band level sets); iso = 0 for zero-crossing semantics.

Measurements on taperLER.vdb (31.8M active voxels, 24 threads, i9-285K):

  Sidecar (stencil)         :  83.6 ms  (2.6 ns/voxel)  fill + debug sum
  Sidecar (stencil+extrap)  :  78.5 ms  (2.5 ns/voxel)  + extrapolate
  Sidecar (+normSqGrad)     :  97.3 ms  (3.1 ns/voxel)  + full WENO5 + Godunov

Phase-3 arithmetic cost isolated:  18.8 ms / 31.8M vox = 0.59 ns/voxel
end-to-end on 24 threads.  Per-core per-voxel: ~14 ns, roughly matching
the expected ~100-FMA WENO5+Godunov budget at AVX2 throughput.

Full Phase-2+3 end-to-end over decode:  90.7 ms = 2.85 ns/voxel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
… arithmetic

Assembly analysis of the sidecar-stencil-normsqgrad pass revealed that
GCC was outlining tiny one-instruction wrappers despite their `inline`
declaration.  The cost model flagged them "too heavy" because the
calling convention passes 256-bit YMM arguments by value, regardless of
how trivial the body actually is.  Result:

  - util::min / util::max / util::where on Simd<float,16> emitted as
    weak symbols with 6-register prologue/epilogue + vzeroupper per
    call.  GodunovsNormSqrd made 20 such calls per voxel batch.
  - detail::WENO5 and detail::GodunovsNormSqrd also emitted as weak
    out-of-line, called 6+1 times from WenoStencil<16>::normSqGrad.
  - WenoStencil<16>::normSqGrad and ::extrapolate also out-of-line,
    called from the pass lambda body.

Each outlining layer forced a vzeroupper (AVX→SSE state transition
required by the x86 ABI when the callee is not marked VZEROUPPER-aware).
Per 16-voxel batch the normSqGrad path incurred ~26 vzeroupper + 8
function calls; over 2M batches × 24 threads this dominated what should
have been a tight inlined SIMD block.

Fix: [[gnu::always_inline]] on six sites.

  Simd.h — the stdx-backend wrappers at namespace nanovdb::util:
    - min<T,W>(Simd, Simd)
    - max<T,W>(Simd, Simd)
    - where<T,W>(SimdMask, Simd, Simd)   (homogeneous)
    - where<T,U,W>(SimdMask<U>, Simd<T>, Simd<T>)  (heterogeneous)

  WenoStencil.h:
    - detail::WENO5<T,RealT>
    - detail::GodunovsNormSqrd<T,MaskT>
    - WenoStencil<W>::extrapolate
    - WenoStencil<W>::normSqGrad

With the attribute applied, all four weak symbols (WENO5, Godunov,
extrapolate, normSqGrad) disappear from the object file.  The pass AcademySoftwareFoundation#8
(sidecar-stencil-normsqgrad) body grows from 1693 B to 9269 B as the
full Phase-2+3 pipeline collapses into one inlined block.  Hot-path
instruction mix (stdx backend, -O3 -march=native, AVX2+FMA3, Arrow Lake):

  518 packed SIMD arithmetic ops:
    218 vmulps + 128 vaddps + 72 vsubps + 48 vdivps + 24 vmaxps
    160 FMAs: 71 vfmadd231ps + 36 vfmadd132ps + 29 vfnmadd132ps
              + 24 vfnmadd231ps
  28 vbroadcastss (scalar constants to YMM)
  1160 / 2044 instructions use ymm (57 %)
  3 calls remaining: decodeInverseMaps, StencilAccessor::moveTo,
                     __stack_chk_fail.
  3 vzeroupper total (down from ~26 scattered across the outlined call
    tree).

Measured on taperLER.vdb (31.8 M active voxels, 24 threads, i9-285K):

  Before:  Sidecar (+normSqGrad) = 97.3 ms = 3.1 ns/voxel
  After:   Sidecar (+normSqGrad) = 82.4 ms = 2.6 ns/voxel
  Delta:   −15 % end-to-end.

Isolating the Phase-3 arithmetic alone (normSqGrad pass minus
sidecar-stencil-extrap pass, which stops at the extrap step):

  Before:  18.8 ms / 31.8 M vox = 0.59 ns/voxel
  After:    2.7 ms / 31.8 M vox = 0.09 ns/voxel
  Delta:   −86 %, i.e. 7× faster on the WENO5 + Godunov arithmetic.

Checksum changed 0x4371e374 → 0x438f725f.  Expected: with the helpers
inlined at a single source location, GCC fuses more mul+add pairs into
FMAs, giving slightly different FP-rounding behaviour.  Both outputs
are equally correct; the difference is non-deterministic FP-arithmetic
artifacts of instruction scheduling, not a semantic change.

No regressions elsewhere: every other pass (StencilAccessor, Legacy,
sidecar-{legacy,stencil,transposed}, legacy-transposed) stayed within
~2 % of its previous timing on this run.

Background: this is the analogue for Phase-3 of the fix documented in
BatchAccessor.md §8h for the StencilAccessor::moveTo path, where
[[gnu::flatten]] on the moveTo() call site served the same purpose
(force-inlining Simd.h helpers that GCC outlines despite `inline`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
…lidation

New standalone example that exercises the complete CPU WENO5 pipeline
on a narrow-band level set loaded from a .vdb file and validates the
result against a scalar reference:

  reference : per-voxel scalar nanovdb::math::WenoStencil<NanoGrid<float>>
              normSqGrad, driven by a TBB-parallel leaf walk.  Tile values
              at internal nodes and in-leaf inactive values are preserved
              verbatim during the openvdb::FloatGrid -> NanoGrid<float>
              conversion, so the stencil's ReadAccessor fetches correctly-
              signed values for taps outside the narrow band (the same
              convention OpenVDB's narrow-band level-set builder sets up,
              which matches the semantics of our explicit extrapolate()).

  fast      : VBM decode -> LegacyStencilAccessor scalar gather ->
              per-tap SIMD load into WenoStencil<SIMDw> -> extrapolate()
              -> normSqGrad() -> Simd->scalar bridge -> per-voxel scalar
              write to outputFast.  No hybrid SIMD StencilAccessor;
              voxel-outer LegacyStencilAccessor chosen for code clarity.

Cross-path indexing: both passes write to a sidecar-shaped output
buffer keyed by ValueOnIndex slot (slot 0 = background, slots 1..N =
active voxels in NanoVDB ordering).  The index grid's accessor resolves
`ijk -> slot` in the reference pass; VBM's firstOffset + bID*BlockWidth
+ lane gives the same slot in the fast pass, by VBM construction.

After both passes, prints a log-decade histogram of |outputRef[i] -
outputFast[i]| across all active voxels, plus max/mean deltas.  On
taperLER.vdb (31.8M active voxels, 24 threads, i9-285K):

  reference (scalar): 120.4 ms  (3.80 ns/voxel)
  fast  (VBM+SIMD)  :  91.3 ms  (2.87 ns/voxel)   speedup: 1.3x

  |Delta| histogram:
    [0,      1e-10) : 23,431,571 (73.75%)    bit-exact match
    [1e-10,  1e-9 ) :          2 ( 0.00%)
    [1e-9,   1e-8 ) :          2 ( 0.00%)
    [1e-8,   1e-7 ) :  1,122,483 ( 3.53%)    single-ULP-class noise
    [1e-7,   1e-6 ) :  7,218,133 (22.72%)    FMA-fusion-class noise
    [1e-6,   inf ) :           0 ( 0.00%)

  max  |Delta| = 9.5e-7  (at slot 12,390,855:
                          ref=2.28233, fast=2.28234)
  mean |Delta| = 4.1e-8

No voxels disagree above 1e-6.  Our explicit extrapolate() plus SIMD
normSqGrad reproduces the scalar OpenVDB/NanoVDB tile-extrapolation
ground truth to within FP-rounding tolerance — the sign-cascade rule
(kPairs table in WenoStencil.h) is consistent with OpenVDB's
topological tile-value convention on typical narrow-band SDFs.

End-to-end 1.3x speedup reflects that both paths pay a roughly equal
sidecar-gather cost; the 7x Phase-3-arithmetic speedup documented in
BatchAccessor.md §11.6 shows up as only a fraction of total time
because gather dominates (60% of fast-path wall clock) and is still
scalar-scatter-shaped in both paths.  A SIMD-gather front-end (the
hybrid StencilAccessor path) would widen this, at the cost of the
library complexity the clean demonstrator deliberately avoids.

Library dependencies exercised by the fast path:
  - LegacyStencilAccessor.h     (per-voxel scalar gather)
  - WenoStencil.h               (Simd compute container)
  - Simd.h                      ([[gnu::always_inline]] helpers)
  - branchless LeafData::getValue (default body of getValue)
  - Weno5Stencil::Taps          (tap tuple; currently in StencilAccessor.h)

Not exercised — candidates for demotion or tighter doc-scoping:
  - StencilAccessor.h + BatchAccessor.h (hybrid SIMD gather path)
  - HaloStencilAccessor.md      (speculative halo gather)

CMakeLists.txt hook mirrors ex_narrowband_stencil_cpu: OPENVDB-gated,
-march=native -fopenmp-simd, no CUDA dependency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
.#weno_nanovdb_cpu.cpp is an emacs per-user lock artifact that got
swept up by `git add` in the previous commit.  It's a dangling symlink
to esifakis@host:pid, not a meaningful repo artifact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Efty Sifakis <esifakis@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants